# TU DELFT

## ET4171 PROCESSOR DESIGN PROJECT



# Report

Authors:

Henrique Dantas (4172922) Luca Feltrin (4270355)

July 2, 2013

# Contents

| Li | st of       | Figures                                     | ii |
|----|-------------|---------------------------------------------|----|
| Li | ${f st}$ of | Tables                                      | ii |
| 1  | Intr        | roduction                                   | 1  |
| 2  | Bas         | eline Analysis and Working Plan             | 2  |
| 3  | Imp         | proved Arithmetic Cores                     | 5  |
|    | 3.1         | Multiplier                                  | 6  |
|    |             | 3.1.1 Wallace Tree Multiplier               | 6  |
|    |             | 3.1.2 Advantages and Disadvantages          |    |
|    |             | 3.1.3 Implementation and Simulation Results |    |
|    | 3.2         | Divider                                     |    |
| 4  | Res         |                                             | 16 |
|    | 4.1         | Synthesis                                   | 16 |
|    | 4.2         | Benchmark Scores                            | 17 |
|    | 4.3         | Metrics Comparison                          | 18 |
| 5  | Cor         | aclusions and further improvements          | 20 |

# List of Figures

| 3.1 | Resulting tree after executing step 1 for and 8 bit by 8 bit   |    |
|-----|----------------------------------------------------------------|----|
|     | multiplication                                                 | 6  |
| 3.2 | Wallace tree structure before and after the reorganization     | 7  |
| 3.3 | Screenshot of the Modelsim's Wave for the multiplier. For this |    |
|     | particular simulation a 32x32 operation is shown, along with   |    |
|     | all the appropriate signals                                    | 9  |
| 3.4 | Divider State Diagram                                          | 11 |
| 3.5 | Divider Block Diagram (during Core Computation, State $= 1$ ). | 12 |
| 3.6 | Radix-4 P-D Plot (Red Dots are Exceptions)                     | 13 |
| 3.7 | Detail of SubGen (d multiples generator based of the qDigit).  | 14 |
| 3.8 | Signal Dump Of Radix-4 Divider Vs. Original Divider            | 15 |
|     |                                                                |    |

# List of Tables

| 4.1 | Resource Utilization Baseline                    | 16 |
|-----|--------------------------------------------------|----|
| 4.2 | Resource Utilization Modified Version            | 16 |
| 4.3 | Benchmarks Scores Baseline                       | 17 |
| 4.4 | Benchmarks Scores Modified Version               | 18 |
| 4.5 | Final Metrics For Baseline And Improved Versions | 19 |

### 1. Introduction

For the Processor Design Project course we have been asked to improve the performance of the LEON3, a 32-bit SPARC V8 processor designed for embedded applications. Our main target is to decrease the computation time for certain benchmarks keeping the power consumption as low as possible, thus for us the most relevant compound metric is the power×benchmark score (P×BS).

The SPARC V8 architecture contemplates the use of instruction and special hardware for integer multiplications and divisions, but with the original configuration the multiplier takes 5 clock cycles to calculate the result and the divider 36, so one of the first things we decided to do is to improve these arithmetic cores. Simple algorithms can be implemented to obtain significant improvements.

The division is not a very common operation, even if in the baseline version is quite slow and can be improved, but the multiplication on the other hand is very common and apart from the calculation for the application software, can be also used to calculate addresses for array access, and so almost every kind of benchmark we are going to run on the processor, as well as the operative system, can take benefit by an improvement, this if of course the compiler is smart enough to use the dedicated istruction when needed.

In section 2 an analysis of the current baseline of the processor is done in order to find the weak points that will be improved. Then in section 3 the improvement of the arithmetic unit, the multiplier and divider, is discussed. In section 4 the results from the simulation, synthesis and the FPGA board are presented and discussed, it's also discussed our methodology to compare different designs based on some metrics. In the end some further improvements of the processor are suggested for future works.

## 2. Baseline Analysis and Working Plan

In order to find out which modification will augment the processor's performance first an analysis of the baseline version is needed. By looking at the documentation and the results of the synthesis of the baseline version we divided all the potential modifications in three categories:

- Arithmetic improvements
- Clock frequency improvement
- Architectural improvements

As explained before the SPARC V8 architecture used by the LEON3 processor contemplates several assembly instructions for complex operations such as multiplication and division. The integer unit of the baseline version is equipped with 2 units that can execute hardware multiplications and divisions, but the performance in terms of execution time of these ones are quite poor. Even if they don't consume too much area with the default settings the multiplication takes 5 clock cycles and the division 36, simple more advanced algorithms can be implemented to make these execution times lower. Before deciding to spend time doing this improvement it's necessary to evaulate the gain in terms of the compound metric we are caring of: P×BS.

Of course the resulting circuits will be more complex so most likely the power consumption will be greater, but the baseline version's units don't use particular tricks to make the power consumption low, they use carry propagate adders in their computation which are not very performant in terms of power and path delay, so there is room to improve these aspects as well or to have in the end a processor where the power consumption is not much worse.

It's important also to study how often these istruction are used by an average program. Assuming that the given benchmarks represent the average usage of the processor's resources, it's possible to see that the division is widely used only in a couple of those: "Division" and "GMPbench, divide" while in the other benchmarks is only rarely used. The gain in terms of benchmark score for the division won't be very high even if a significant gain is expected in the division-related benchmarks. On the other hand, as explained in the next section, a simple division algorithm can be implemented to have the division executed in half of the time of the baseline version, this without having a too big exacerbation of the power peformance so we decided to spend time on the division unit.

The multiplication, oppositely to the division, is very widely used by all kind of benchmarks, in particular the unit can be used also to calculate addresses by the operative system running on the FPGA board so by improving this unit we expect a general improvement of all the benchmark scores.

————HENRIQUE: MAYBE WRITE SOMETHING MORE HERE ABOUT WHY IT'S GOOD TO IMPROVE THE MULTIPLIER————

About the clock-frequency-related modifications after a synthesis of the baseline version, we noticed that the slowest path of the whole processor is located in the DRAM controller (from "ddrsp0.ddrc0/ddr32.ddrc/ra.raddr\_0" to "ddrsp0.ddrc0/ddr\_phy0/ddr\_phy0/xc4v.ddr\_phy0/casgen"). Trying to modify this slowest path would be most likely very difficult because first the delay in the circuit is affected by the place&routing process, so after the re-design of the arithmetics unit this path could change, second the documentation is very leaky about the implementation details and the available time was our most tight constrain, so in the end we decided to postpone our decision after we had a working version with the other modifications in order to see how the slowest path changed in the meantime.

About the architecture, there are several different improvements that can be done. The baseline version is equipped with a static branch predictor, this can augment performance nicely but better algorithms exist such as 1 or 2 bit branch prediction buffer or correlation prediction. An improvement in this sense would be only slight but present in every kind of benchmark: branches are the 30The static prediction doesn't contemplate heavy calculations, indeed it's very simple so the power consumption in the baseline version due to the prediction is negligible. Usually more evoluted algorithms still have not so heavy computation, for the branch prediction buffer the algorithm consist in a small state machine, but even a slight exacerbation can be dangerous because more calculations are done for each branch fetched. We think that still the advantage in terms of execution time would be greater than the disadvantage so this could be a very nice modification to do. There are others modification that can be done, like designing a scalable architecture such as Tomasulo's, but this kind of changes require a really deep change of the whole architecture, the code for the baseline version of the integer unit is more than 3000 lines of code, and so it's very unlikely to have a working scalar integer unit in the small time we have for this project so we decided to not dedicate time for that and concentrate in other things.

In the end we focused mainly on the improvement of the arithmetic unit because we thought this would have been the most efficient way to spend our time, because it would have given us the most improvements with the less efford. We scheduled the other modification after we had a working improved multiplication and division.

## 3. Improved Arithmetic Cores

In order to improve the performance of the arithmetic unit we redesigned from the ground up both the two multiplier and divider units.

In order to make them compatible with the rest of the processor we studied in a detailed way all the handshaking signals.

The original multiplier can be configured for "2 cycles" of latency (instead of 5) through the use of make xconfig. It is important to note that for this particular setup the ready signal is not used. So although we are writing a new multiplier it is necessary to configure the processor with a 2-cycle multiplier as well, so the processor can handle the handshaking signals generated by our multiplier in the correct way.

For the divider there is not a previous configuration, so the processor knows that an operation is completed by inspecting the ready and nready signals which have been reproduced following the specifications.

The part of the core that handles the other signals such as start, flush or holdn, has been designed to mimic the original version, thereafter all the handshaking signals are handled and generated following the specifications to enable unit compatibility with the processor.

A more detailed description of the handshaking signals is presented later.

### 3.1. Multiplier

#### 3.1.1. Wallace Tree Multiplier

For this project it was decided the most appropriate multiplier scheme would be the Wallace Tree Multiplier. The most important reason for this choice is due to its great performance although at the cost of gates and area.

The Wallace tree is a regular hardware structure to multiply two operands. It was invented by Chris Wallace, an Australian Computer Scientist in 1964. The algorithm can be divided in three major steps:

1. The initial AND operation between all combinations of bits of each operand. The weights must be adjusted according to the location of the operands, just like in the classical pen-and-paper algorithm. The resulting tree, using dot notation, is shown in figure 3.1.



Figure 3.1: Resulting tree after executing step 1 for and 8 bit by 8 bit multiplication.

2. Thereafter the tree must be reduced through the use of half adders and full adders. These will convert each two or three "dots", respectively, into one and a carry out for the following column. This step shall be iterated sufficient times until only two numbers emerge.

3. Finally, the two remaining numbers can be summed with a conventional adder. The width of the result should be equal to the sum of the widths of the original operators. For example for a 32 bit times 32 bit operation, the result shall be 64 bits wide.

For our particular implementation the aforementioned description was modified to accommodate signed numbers through the use of the modified Baugh-Wooley algorithm. The details of this alteration will be explained in section 3.1.3.

#### 3.1.2. Advantages and Disadvantages

The main advantages of this scheme are the speed obtained and regular mapping in hardware. The structure of the tree is consistent throughout the several steps. On the other hand, this design is costly in the amount of gates necessary and total area. The latter can be improved by reorganizing the tree for efficiency. The 'waste' in the traditional structure is particularly noticeable on the extremes of each line of the tree, as shown in figure 3.2a. [3] details possible techniques that can be implemented to tackle this issue and reduce the area cost of the Wallace multiplier. The main idea is to split the tree into two overlapping trees, hence saving area. Thereafter the additions take place in opposite directions. However due to lack of time it was not possible to implement the proposed ideas on the multiplier.



(a) Detail of Wallace Tree from [3]. It is clearly visible a significant percentage of unused area.



(b) Modified Wallace Tree from [3]. The image reflects the better area utilization based on the algorithm present in the paper.

Figure 3.2: Wallace tree structure before and after the reorganization.

#### 3.1.3. Implementation and Simulation Results

This section will be used to introduce the reader to the implementation requirements for the multiplier. Followed by a simple, high-level description of the architecture used to accomplish them.

The implementation tries to closely mimic the three algorithm steps. However additional signals are necessary for the correct interface with the processor. According to the documentation [4] and [5], there are 3 input signals, RST, CLK and MULI and 1 output signal MULO. Moreover, there are 4 generics: infer, multype, pipe and mac. MULI includes the 32 bit operands (along with an extra signal bit), and severals flag to request flushing of the current operation, indicate signed multiplication, to initiate, and to start multiply and accumulate. On the other hand MULO includes a self explanatory ready signal, a nready signal (not used), condition codes that reflect if the result is zero or negative and finally the 64 bit result.

The most significant generic is multype that configures the multiplier to different operand's sizes. These are 16x16, 32x8, 32x16 and 32x32. It is important to note that a discrepancy between the documentation and the actual code exists regarding the multiplier. The infer generic has been replace by tech (related to the target architecture) in the original VHDL file: mul32.vhd. Our custom multiplier replicates this configuration.

As mentioned in 3.1.1 the algorithm can be divided in three major parts. On part 1 the operands are ANDed together and their weights are adsted. To reflect this a three dimensional STD\_LOGIC\_VECTOR was created

justed. To reflect this a three dimensional STD\_LOGIC\_VECTOR was created and named WallaceTree. The first dimension reflects the number of stages (or levels) necessary to finish the operation. These values are computed offline for the four multype possibilities. The second dimension relates to the number of 'lines' of the matrix, or its height. The range is always equal to the width of the first operand. Finally, the third dimension accounts for the number of columns, or its width, which is necessarily equal to the sum of the width's of each operand. In practice the 3D array was converted to 1D as synthesis tools have better support for the latter, nonetheless the same principles stand.

To be able to multiply signed operands the modified Baugh-Wooley multiplication was used (*cf.* [2]). Therefore some values are complemented if the input is signed. Moreover on the final step a constant value is added to the two operands. This action completes the first step.

Thereafter each group of 3 or 2 bits must be compressed to 1, using full-adders and half-adders respectively. If an odd number of bits exist in a certain column the last bit is transferred to the next level. Repeating this process sufficient times<sup>1</sup>, the height of the matrix is reduced to 2. These can be added with a conventional adder to determine the final result.

To help with these operations the exact number of full-adders and half-adders for each column of each stage were precomputed. Moreover two other constants arrays exist that indicate the number of carry ins each column receives and if it has a remainder bit (odd numbered of lines). Having this information the location of the appropriate inputs (x, y and carry in for the full adder) and outputs (s and carry out) are mapped to a full (or half) adder instance to determine the result.

Step 3 is simply an addition with the first two 'lines' of the last stage of the WallaceTree signal. However, if the multiplication is signed an extra STD\_LOGIC\_VECTOR is added to these, as explained in [2].

The extra signals indicate the result is ready and the condition codes are updated to reflect the result of the operation. The evolution of the signals driven by our testbench is shown in 3.3.



Figure 3.3: Screenshot of the Modelsim's Wave for the multiplier. For this particular simulation a 32x32 operation is shown, along with all the appropriate signals.

<sup>&</sup>lt;sup>1</sup>The exact number of iterations required can be inferred from the integer constant levels previously mentioned.

#### 3.2. Divider

The algorithm implemented in the original version of the processor is one of the simplest but the slowest available: a radix-2 non restoring division.

Several other algorithms can compute the division faster but all of them present disadvantages that must be taken into account according to the target application.

Algorithms like repeated multiplication or reciprocation are fast but require a significant amount of area because they require a whole multiplier to work which is alone a pretty big circuit. Similarly an array divider would have been very fast only If we had control on the place&routing process in order to create a regular structure, something that in FPGAs we cannot control so we discarded that option.

In the end we decided to implement a simple radix-4 division algorithm for simplicity of implementation and of the circuit itself.

Using an higher radix could have improved performance but the size of the lookup table required by the algorithm would have increased again the area consumption. The area itself is not a big problem since as we said we care more about power consumption and execution time, but a larger circuit means also harder computations and so more energy consumption.

The divider consist in a state machine (its diagram is shown in figure 3.4) which checks if the inputs will generate an overflow by doing a trial subtraction between the divisor and the MSBs of the dividend, and performs a preliminary shift to put the divisor in the appropriate range in order to match the right range for the look-up table to work.



Figure 3.4: Divider State Diagram.

In the figure above is shown also when the ready signals are asserted, from the documentation we see that the ready signal must be set the cycle before the result is stable so in the last state of the state machine because once the machine returns in the idle state the registers are "freezed" and so the results is stable, and the nready signal must be asserted 3 cycles before the results is stable, and after doing some math this cycle is when during the core computation the counter's value is 14 ("1110"). By observing these signals the integer unit can find out when the operation is finished in order to eventually fetch another instruction or to make the pipeline go one step further.

After that, the real computation begins and lasts 16 clock cycles. The block diagram of the divider while it's in this state is shown in figure 3.5.



Figure 3.5: Divider Block Diagram (during Core Computation, State = 1).

The algorithm is very similar to the original radix-2 version but in this case the partial reminder (x) is shifted by 2 bits every cycle and the circuit has to guess the quotient digit from the range [-3,3]. "qSelector" is the lookup table which performs the quotient digit guessing and it's based on the p-d plot of the radix-4 SRT division shown in figure 3.6. In case of unsigned division only the right half of the p-d plot is being used.

The quotient digits are in a radix-4 redundant format so a conversion to binary format is needed. The conversion is performed gradually every cycle by the 32-bit adder "CPAq" which shift and sum each generated digit with the already calculated quotient.

An addition/subtraction in Carry-Save format would have been much faster and also easier for both the quotient and the partial reminder, but the selection of the quotient digit would have required the analysis of the most significant bits of both the sum and the carry of the reminder (x) making the lookup table several orders of magnitude bigger and so consuming lots of area. In our divider "SubGen" generates the multiple of d to sum with the current partial reminder in a carry save format, all this operands are being compressed by a 3-2 compressor (1 full adder of delay) and finally the new partial reminder is calculated with a 35-bit adder, "CPAx". Using this architecture is possible to calculate easely the triple multiples of d allowing us to have a simpler p-d plot and so a smaller look-up table, in particular the -3d, which is the most difficult, is calculated as the negate of 2d concatenated

with a 1 plus the negate of d and the signal t is setted to 1 so that can be summed twice in the compressor and as the carry input of CPAx, in this way its weight is 2 and so can complete the calculation of the 2s complement of 2d.

The principle behind the calculation of the multiples is explained better in figure 3.7.



Figure 3.6: Radix-4 P-D Plot (Red Dots are Exceptions).



Figure 3.7: Detail of SubGen (d multiples generator based of the qDigit).

Other solutions have been analyzed, such as having a lookup table only for unsigned division, half of the size of the final version, and handle the sign separately but the synthesis has shown that the resources utilization would have not changed significantly while one more cycle would have been needed. Hence, we decided to keep the current divider.

At last, in order to check the compatibility of the radix-4 divider with the rest of the system, and its correct functioning, we designed a testbench. In this testbench we placed both the old version and the new version of the divider and the inputs are routed to the two units equally, we expect to have the same outputs but in a shorter time. Of course it's prohibitive to test the unit for each possible inputs so we found a set of critical operations. There are some categories of operations in this case, simple signed/unsigned division such as 350/100 or -350/100 which can be performed with the signed signal asserted or not, operations involving exceptions in the p-d plot, and operations generating overflows. The first category is not critical but just to be sure and have a complete testbench we implemented some of them. As shown in figure 3.6, there are a couple of exceptions in the p-d plot. These exceptions are handled by the qSelector apart from all the other cases with

dedicated lines of code (simple ifs). For instance one of these exceptions is when the divisor is "0100" and the dividend is "111000" followed by all zeros (otherwise it would not be an exception because the point would be inside the uncertanties area and not in the bottom left corner), in this case the quotient digit must be -2 instead of -1 so one of these operations could be simply -16/4, for this operation after the pre-shift the input of qSelector would be exactly these ones. For the operations involving overflows there are several possibilities where the dividend is very high and the divisor very low, enough to make the result greater than  $2^{32}$ . There are also some operations that when considered to be signed don't generate overflows while if unsigned they do. This happen when the most significant bit of the dividend is one, its weight is  $2^{63}$  when unsigned and  $2^{64}$  when signed and so it can change a lot the absolute value of the result.

All of this kind of operations have been analyzed in a testbench, a part of it in figure 3.8, and the result is always the same as the baseline version of the divider. The only different thing is that when there is an overflow the result of the radix-4 divider is different from the baseline, sometimes the simulator puts it as "X" because in an overflow the qSelector works in some of the not allowed areas, but the overflow flag is asserted normally so the processor won't consider the actual result and so this is not a problem.



Figure 3.8: Signal Dump Of Radix-4 Divider Vs. Original Divider.

### 4. Results

### 4.1. Synthesis

In order to evaluate the performance of our improved processor we need first to evaluate the performance of the baseline version of the processor.

The synthesis tool reported the values shown in table 4.1 for the resources utilization.

| Timing Summary (max clock freq. [ MHz ]) | 80.522  |
|------------------------------------------|---------|
| # of Occupied Slices                     | 9904    |
| Total $\#$ of 4-input LUTs               | 16889   |
| Quiescent power [ W ]                    | 2.467   |
| Dynamic power [ W ]                      | 0.721   |
| Total power [ W ]                        | 3.188   |
| P/f [ $W/MHz$ ]                          | 0.03959 |

Table 4.1: Resource Utilization Baseline.

Notice that the value "P/f" indicates the energetic efficiency of the processor. Another useful insight is the fact that the power consumption is almost proportional to the clock frequency, therfore we can use this value to estimate the power consumption at different clock frequencies.

From the synthesis report we can also see the slowest path which determines the max clock frequency.

This path is from ''ddrsp0.ddrc0/ddr32.ddrc/ra.raddr\_0'' to ''ddrsp0.ddrc0/ddr\_phy0/ddr\_phy0/xc4v.ddr\_phy0/casgen''.

Those components belong to the SDRAM controller and the path is located between the controller and the physical interface with the memory.

The synthesis of our modified version gave us the results showed in table 4.2:

| Timing Summary (max clock freq. [ MHz ]) | 40.197  |
|------------------------------------------|---------|
| # of Occupied Slices                     | 11886   |
| Total $\#$ of 4-input LUTs               | 20469   |
| Quiescent power [ W ]                    | 2.511   |
| Dynamic power [ W ]                      | 0.832   |
| Total power [ W ]                        | 3.343   |
| P/f [ $W/MHz$ ]                          | 0.08317 |

Table 4.2: Resource Utilization Modified Version.

As expected the area consumption is moderately higher than before as the units we designed are more complex than the baseline. In particular the lookup table in the divider, and the Wallace tree structure in the multiplier are very space-hungry.

Also the power consumption increased for the same reason, the algorithms are more complicated so more energy is expended for all the calculations. However this trade-off was envisioned at the start of the project, and as we will see later the benchmark's performance improved.

The most significant change is in the maximum clock frequency which was drastically reduced. The main culprit is the multiplier which only takes one clock cycle to finish. Originally, the unit was designed to complete after two cycles. However it was verified that when the processor is configured with a two-cycle latency, it expects the 32x32 operation to finish in one cycle. Thus the multiplier had to be adjusted for this behavior.

Alternatively, the mul32 unit could be adapted to the remaining multiplier configuration (i.e. 4-cycle latency) so the maximum frequency would not be so severely affected.

#### 4.2. Benchmark Scores

Once the processor is synthesized and has been loaded in a FPGA board, Linux is initiated on the processor along with several benchmarks. In table 4.3 the execution times of these benchmarks are reported, for detailed scores please see the accompanying Excel file.

| Stanford [ s ]                                   | 2.30   |
|--------------------------------------------------|--------|
| Whetstone [ s ]                                  | 116.2  |
| Gmpbench Multiply [ $\mathrm{Op}/\mathrm{\ s}$ ] | 781    |
| Gmpbench Divide [ $\mathrm{Op}/\mathrm{\ s}$ ]   | 15876  |
| Gmpbench RSA [ $\mathrm{Op}/\mathrm{\ s}$ ]      | 5123   |
| Division [ s ]                                   | 8.06   |
| Mibench JPEG (average) [ s ]                     | 23.215 |
| SSD [s]                                          | 10.59  |
| Total [ s ]                                      | 219.28 |

Table 4.3: Benchmarks Scores Baseline.

The scores obtained with the modified processor are reported in table 4.4.

| Stanford [s]                                     | 2.21   |
|--------------------------------------------------|--------|
| Whetstone [ s ]                                  | 112.08 |
| Gmpbench Multiply [ $\mathrm{Op}/\mathrm{\ s}$ ] | 914    |
| Gmpbench Divide [ $\mathrm{Op}/\mathrm{\ s}$ ]   | 19205  |
| Gmpbench RSA [ $\mathrm{Op}/\mathrm{\ s}$ ]      | 5353   |
| Division [ s ]                                   | 7.31   |
| Mibench JPEG (average) [ s ]                     | 21.76  |
| SSD [ s ]                                        | 8.60   |
| Total [ s ]                                      | 206.92 |

Table 4.4: Benchmarks Scores Modified Version.

From these results we can see that almost every benchmark had a healthy improvement, in the end the total execution time improved by 5.6%.

These scores are not as good as expected but likely the execution of the operating system on the processor causes a non negligible overhead in the execution due to the scheduling. As an example, the divider takes about half the time to execute but the execution time of the division related benchmarks is only about 10% better.

## 4.3. Metrics Comparison

In order to get a fair comparison between the baseline and the improved processor some standard metrics have to be calculated and studied.

Usually these metrics are A, the area consumption here calculated as the weighted sum of the number of occupied slices and the number of 4-input LUTs used where the weight is the reciprocal of the number of available resources, D is the delay or the reciprocal of the max clock frequency and indicate the delay of the slowest path, P is the power and it's simply calculated as the total power consumed by the Dhrystone benchmark used for the simulation and BS is the benchmark score which indicate how fast a program can be executed, it's calculated as the total execution time of the benchmarks on the FPGA board.

For the area we are using that complex formula because in order to have a parameter which indicate the true cost of our implementation we needed something more than just the number of resources used. We assumed that the number of resource available on the device is inverse proportional to their effective cost, so if we use their reciprocal as a weight to calculate the average, we obtain a metric which is most likely nearer to the effective cost of the implementation once implemented on chip.

Moreover some composite metrics can be observed: these metrics consider two or more primitive metrics and often are more interesting than the latter because any improvement in one metric is usually accompanied by decrease in other. Composite metrics show the overall performance, and give insight on the trade-offs.

Since we want to speed-up the execution of the software while keeping the power consumption low because this is a processor designed for embedded applications, the composite metric we are interested the most is  $P \times BS$ . It reflects how much the processor is able to compute for the same amount of energy.

Additional composite metrics are  $A \times D$ ,  $A \times BS$  and  $P \times D$ . Since we focused on the execution time and power consumption one can notice that the other metrics are worse in our version compared to the baseline.

All the baseline's and modified units' synthesis and benchmark results have been condensed in table 4.5.

Due to the important impact the multiplier has on these metrics, results are shown before and after this custom unit is integrated.

|                            |                                                                             | Primitive                                                                                        | e Metrics                                                                                       |                                                                              |
|----------------------------|-----------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------|
| Version                    | Α                                                                           | D                                                                                                | P                                                                                               | BS                                                                           |
| Baseline                   | $2.68 \times 10^{4}$                                                        | $1.24 \times 10^{-2}$                                                                            | 3.19                                                                                            | $2.19 \times 10^{2}$                                                         |
| Modified (Div)             | $2.83 \times 10^{4}$                                                        | $1.24 \times 10^{-2}$                                                                            | 3.21                                                                                            | $2.13 \times 10^{2}$                                                         |
| Modified (Mul&Div)         | $3.24 \times 10^{4}$                                                        | $2.49\times10^{-2}$                                                                              | 3.34                                                                                            | $2.07\times10^2$                                                             |
| Improvements (Div)         | $5.8\mathrm{s}\%$                                                           | -                                                                                                | 0.7%                                                                                            | -2.7%                                                                        |
| Improvements (Mul&Div)     | 21%                                                                         | 100%                                                                                             | 5%                                                                                              | -6%                                                                          |
|                            |                                                                             |                                                                                                  |                                                                                                 |                                                                              |
|                            |                                                                             | Composit                                                                                         | te Metrics                                                                                      |                                                                              |
| Version                    | $A \times D$                                                                | Composit<br>A×BS                                                                                 | te Metrics<br>P×D                                                                               | $P \times BS$                                                                |
| Version<br>Baseline        | $\begin{array}{c} {\tt A} {\times} {\tt D} \\ 3.33 \times 10^2 \end{array}$ | -                                                                                                |                                                                                                 | $\begin{array}{c} \text{P}{\times}\text{BS} \\ 6.99 \times 10^2 \end{array}$ |
|                            |                                                                             | $A \times BS$                                                                                    | $P \times D$                                                                                    |                                                                              |
| Baseline                   | $3.33 \times 10^{2}$                                                        | $\begin{array}{c} \text{A} \times \text{BS} \\ 5.88 \times 10^6 \end{array}$                     | $\begin{array}{c} {\rm P}{\times}{\rm D} \\ 3.96\times10^{-2} \end{array}$                      | $6.99 \times 10^{2}$                                                         |
| Baseline<br>Modified (Div) | $3.33 \times 10^{2}$<br>$3.52 \times 10^{2}$                                | $\begin{array}{c} \text{A} \times \text{BS} \\ 5.88 \times 10^6 \\ 6.05 \times 10^6 \end{array}$ | $\begin{array}{c} {\rm P}{\times}{\rm D} \\ 3.96\times10^{-2} \\ 3.99\times10^{-2} \end{array}$ | $6.99 \times 10^2$<br>$6.85 \times 10^2$                                     |

Table 4.5: Final Metrics For Baseline And Improved Versions.

## 5. Conclusions and further improvements

The improvements made to the arithmetic units have improved the benchmark performance of the processor. Although they are modest, they reflect the chosen target scenario.

Because of lack of time we were unable to perform more changes, but of course there are many things to change in the architecture to improve further the performance.

Different configurations for the multiplier should have been tested to decrease its impact on power and maximum frequency.

The size and structure of the cache memory could be changed to decrease the probability of misses and consequently the benchmarks execution time. Although this could have been done using the configuration tool it would have not been fruit of our own merit. Moreover increasing the cache size probably would have increased also the power consumption, escalating even more this issue.

The LEON3 uses a static branch prediction in the integer unit, which is a good compromise between power consumption, as no difficult computation is needed, yet there are gains in terms of execution time. To improve performance further, a 1 bit or a 2 bit branch prediction buffer algorithm could be implemented. The calculations needed are more intricate and computed more often (30% of the instructions are branches) henceforth the power consumption would probably increase. On the other hand the gains in terms of execution time could make up for the loss.

In the end another heavy modification that could have been done is making the integer unit super-scalar and implementing Out of Order execution. Hypothetically it could yield significant performance improvements in terms of execution time. Notwithstanding a complete re-design of the integer unit would have been needed, which deemed it impossible to complete in the (small) time budget for this project.

## References

- [1] Parhami, Behrooz. Computer arithmetic: algorithms and hardware designs. Oxford University Press, Inc., 2009.
- [2] Parhami, Behrooz. Computer arithmetic: Part III, Multiplication. Available at http://www.ece.ucsb.edu/~parhami/text\_comp\_arit.htm# slides.
- [3] Bohsali, Mounir, and Michael Doan. "Rectangular Styled Wallace Tree Multipliers."
- [4] GRLIB IP Core User's Manual Version 1.1.0 B4104, November 2010
- [5] GRLIB IP Library User's Manual Version 1.1.0 B4100, October 1, 2010